Divisive normalization is an efficient code for multivariate Pareto-distributed environments

Significance Divisive normalization is a ubiquitous computation commonly thought to be an implementation of the efficient coding principle. Despite empirical evidence that it reduces statistical redundancy present in naturalistic stimuli, making the relationship between this neural code and the statistics of a stimulus precise has remained elusive. This paper closes this gap by providing a necessary and sufficient condition for divisive normalization to generate an efficient code. The multivariate Pareto distribution found to be efficiently encoded exhibits many stylized features of naturalistic stimulus statistics and provides testable predictions. In an empirical analysis, we find that the Pareto distribution captures the statistics of natural images well, suggesting that divisive normalization may have evolved to efficiently represent stimuli from such distributions.


Supporting Information Text
Equation numbers without an "A" prefix refer to the main text.
Proof of Proposition 1. To see that r(x) ∈ ∆ for all x ∈ R n + (non-negative real vectors), note that We next show that for all y ∈ ∆ there is a unique x ∈ R n + such that r(x) = y, i.e., such that for all i. Letting zi = λix α i , wi = λiyi/γ, andb = b α , this is equivalent to showing that there is a unique z ∈ R n + such that zĩ b + n j=1 zj = wi, for all i, or, equivalently, that the system of equations has a unique solution z ∈ R n + . The matrix determinant lemma (1, Theorem 18.1.1) states that for an invertible square matrix A and column vectors u and v, det(A + uv T Setting A = In, u = −w, and v = 1, we then get λiyi < 1 for all y ∈ ∆. Therefore, the system of equations A.1 indeed has a unique solution, so r is invertible and its image is ∆. Proof of Proposition 2. First note that (2, Eq. 8.48) I(X; g(X) + ε) = h (g(X) + ε) − h (g(X) + ε|X) = h (g(X) + ε) − h (ε|X) .
[A. 3] It follows that maximizing I(X; g(X) + ε) is equivalent to maximizing h (g(X) + ε) (cf. 3,4). For small noise, maximizing I(X; g(X) + ε) is approximately equivalent to maximizing h (g(X)). To see that this is an arbitrarily close approximation, note that by repeated application of the chain rule (2, Theorem 17.2.2 and Problem 2. where the last line uses independence of g(X) and ε, which further implies (2, Lemma 17.2.1) and thus |I(X; g(X) for any δ > 0, as long as h (ε) < δ.
Proof of Proposition 3. We want to show that the pdf f Y of Equation 3, which equivalently satisfies among all pdfs g satisfying g(y) ≥ 0 with equality for all y / ∈ C, C g(y)dy = 1.
Our proof parallels Theorem 12.1.1. of Cover and Thomas (2). Recall that the Kullback-Leibler divergence satisfies (2, eq. 8.87) where the inequality is strict unless f Y = g almost everywhere (2, Theorem 8.6.1). We have shown that the pdf f Y of Equation than any other pdf with support C.
Proof of Theorem 1. Since the function r is invertible and has continuous derivatives, a change of variables implies that whenever the determinant of the Jacobian Jr(x) of r is non-zero (5, Theorem 8.1.7). † We first compute the Jacobian of where In is the n × n identity matrix. Using the matrix determinant lemma (Eq. A.2), with A = d(x) · In, u = −x, and v = λ, we obtain In order to find the Jacobian of r(x) =r(x α ), note that by the multivariate chain rule: Since det (Jr(x)) > 0 for all x ∈ R n ++ we then have for any positive vector x ∈ R n ++ .

Proof of Theorem 2.
Note that the support of r(x) is its image which, by Proposition 1, is given by ∆. From Theorem 1 combined with Proposition 3 it follows that is a necessary and sufficient condition for r to maximize the entropy of the output distribution net of expected costs over all representations with support ∆. Setting α ≡ β and b/λ The result follows from the fact that for any constant translation µ ∈ R n ∀s > µ. [5] Proof of Theorem 3. In the special case of constant costs c (y) =c for all y ∈ ∆, Equation 5 of Theorem 2 reduces to s > µ. [7] It remains to show that Equation 7 is the pdf of a multivariate Pareto type III distribution with joint survival function (7, Eq. 6.1.17 with γi = 1/β for i = 1, . . . , n), To see this, definē and note that, for k = 0, 1, 2, . . . , n, This can be shown by induction. Note thatF , so that Equation A.4 also holds for k. Next, note that, given s1, . . . , sn and denoting, for any i, the event Si > si by Ai and its complement by A c i , the cdf is obtained from the survival function as F S (s1, . . . , sn) = P (S1 ≤ s1, . . . , Sn ≤ sn) where the fourth equality follows from the probabilistic version of the inclusion-exclusion principle. Therefore . . . , sn), so, using Equation A.4 with k = n, we find that the pdf associated with the survival function of Equation 6 is given by f S (s1, . . . , sn; µ, σ, β) = ∂ n ∂s1 · · · ∂sn F S (s1, . . . , sn; µ, σ, β) which indeed coincides with Equation 7.

Derivation of the Marginal Distribution (Equation 8). The survival function of the marginal distribution is
so its cdf is obtained as and its pdf is which is a univariate Pareto type III distribution. Its mode, for β > 1, is (7, Eq. 3.3.4) Equations 9 and 11). From (7, Eq. 6.1.27) it follows, § withα = 1 and γi = 1/β and using Γ(1) = 1, that for β > 1 the mean is given by

Derivation of Moments (Including
where Γ is the Gamma function and the second equality follows from its recursive expression and Euler's reflection formula, which implies that for all z / ∈ Z. From (7, Eq. 6.1.22) it follows that the conditional mean, for β > 1/n, is given by From (7, Eq. 3.3.11) it follows that, for β > 2, the variance of the Pareto type III distribution is ¶ The covariance (Equation 11) is obtained (7, Eq. 6.1.29), for β > 2, as so that the correlation coefficient is β = 1 (Equation 12). The variance of a univariate Pareto type II distribution with cumulative distribution function (7, Eq. 3.2.2)

Derivation of the Conditional Variance for
is, for anyα > 2, given by (7, Eq. 3.3.13) From Equation 13 (see below) we know that the conditional distribution of a multivariate Pareto III distribution with β = 1 is a univariate Pareto type II distribution with location µ = µi, scale σ = σi 1 + j̸ =i , and shape parameterα = n.
For any n > 2 we thus have where we have used

Derivation of the Conditional Distribution (Equation 13
). From (7, Eq. 6.1.20), withα = 1 and γi = 1/β, it follows that the conditional distribution of the multivariate Pareto type III distribution of Equation 6 is a univariate Pareto type IV distribution, where the cdf of a univariate Pareto type IV distribution is (7, Eq. 3.2.8) This implies that the conditional distribution has cdf, for si > µi, [13] ¶ A minus sign appears to be missing in (7, Eq. 6.1.28).